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In the context of adaptive Monte Carlo algorithms, we cannot directly gen- 
erate independent samples from the distribution of interest but use a proxy 
which we need to be close to the target. Generally, such a proxy distribution 
is a parametric family on the sampling spaces of the target distribution. For 
continuous sampling problems in high dimensions, we often use the multi- 
variate normal distribution as a proxy for we can easily parametrize it by 
its moments and quickly sample from it. The objective is to construct sim- 
ilarly flexible parametric families on binary sampling spaces too large for 
exhaustive enumeration. 

Keywords Binary parametric families • Sampling correlated binary data 



^ : 1 Introduction 

in , 

lO . 1-1 Parametric families for Monte Carlo 

O 

' We discuss parametric families on binary spaces against the backdrop of Monte Carlo 

00 ■ applications. The construction of binary parametric families qg that can model and 

reproduce the dependence structure of the target distribution ir is a difficult task, and 
many concepts of modeling multivariate binary data fail to provide parametric families 
that are suitable for adaptive Monte Carlo algorithms. Therefore, we do not only discuss 
workable families but also approaches that are impractical in order to provide a thorough 
review of all available methods. 



1.2 Notation 

We denote scalars in italic type, vectors in italic bold type and matrices in straight bold 
type. We write diag [a] for the diagonal matrix of the vector a and diag [A] for the main 
diagonal of the matrix A. The determinant is denoted by det [A]. We write a*, and a m j 
for the ith row and jth column of A, respectively. We write A >- to indicate that A 
is positive definite. Given as set M, we write \M\ for the number of its elements, M for 
its closure and 1m for its indicator function. 
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We write B = {0, 1} for the binary space and denote by d G N the generic dimension. 
Given a vector 7 GB 11 and an index set I C {1, . . . , d}, we write 7/ G B^ for the sub- 
vector indexed by / and -f-j G M d ~^\ for its complement. If I is a sequence {?,••■ ,j} 
we use the more explicit notation instead of 7/ and 7^ if / = {i}. 

We write 7/ x and 7/ for 7 with its components indexed by I set to 1 and 0, respec- 
tively. In particular, we frequently use the short notation aj.7^ for an + ^}=i a «j7j 
where A is a lower triangular matrix. 

1.3 Data from the target distribution 

In the sequel, let d > denote the dimension of the binary space M d = {0, l} d . Adaptive 
Monte Carlo algorithms are generally able to produce a, not necessarily independent 
and possibly weighted, sample 

w = (twi, . . . ,0 G [0, X = (a*, . . . , x n y G B nxd 

from the target distribution 7r we want to emulate using a binary family. We define the 
index set D = {1, . . . , d) and denote by 

x i : = Y^k=l W k x k,i, Xij ■= Ylk=l W k x k,iXk,j, hj G D (1) 

the weighted first and second sample moments. We further define by 

rp . . rp . ry . 

nr-= 7 - n n -v iJeD ' (2) 

V x i\^ ~ x i ) x j V 1 — -^jj 

the weighted sample correlation. 

1.4 Suitable parametric families 

We first frame some properties making a parametric family suitable as sampling distri- 
bution in adaptive Monte Carlo algorithms. 

(a) For reasons of parsimony, we want to construct a family of distributions with at 
most dim(#) < d(d+ l)/2 parameters. 

(b) Given a sample X = (aci, . . . , x n ) J from the target distribution ir, we need to estimate 
6* such that the binary family qg* is close to ir. 

(c) We need to generate samples Y = (j/i, . . . ,y m ) J from the family qg. We need the 
rows of Y to be independent. 

(d) For some algorithms, we need to evaluate the probability qg{y). For instance, we 
need qg(y) to compute importance weights or acceptance ratios in the context of 
Importance Sampling or Markov chain Monte Carlo, respectively. 

(e) Analogously to the multivariate normal, we need our calibrated binary family qg* to 
reproduce the marginals and covariance structure of it. 



2 Distributions on binary spaces 

Before we embark on the discussion of binary families, we make some observations which 
hold true for every binary distribution. The notation and results introduced in this 
section will be used throughout the rest of this work. Here, we denote by tt some generic 
distribution on M d 
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Moments We use the short notation, 

for the product of all components index by / with n«e0 = 1- Since U](~f) = 1 iff -ji = 1 
for all i E I, uj is the indicator function for the unit vector We can characterize 
every distribution on M d by 2 d — 1 full probabilities 

P/:=P w (7/ = 1,7d\j = 0), IQD 

or by 2 d — 1 cross- moments, that is marginal probabilities, 

mi :=E n ( Ul (j))=F n (~ ri = l), I CD. 

In the following, we assume that rrii G (0, 1) for all i G D, since for € {0, 1}, the 
component 7$ = im is constant and therefore not part of the sampling problem. 

For the product of components normalized to have zero mean and unit variance, we 
write 



v i(l) ■= ]liei(7i - rni)/\/mi(l-mi), I CD. 
Note that (vij) is the correlation between ji and jj. Therefore, we call 

CI :=E W («j( 7 )) 

the correlation of order |/|. 
Marginals We use the notation 

for the marginal distributions. Note the connection to the cross-moments 

7rj(l|/|) = EeeB^I^I ^(il/h €) = E 7 eBd u K7) tt(7) = m 7 . (3) 

Representations Let ir be the mass function of a binary distribution and suppose there 
is a bijective mapping t:131/-> ir(M d ). There are coefficients a/ G R such that 

7r (7) = r \Eicd a i Uiei 7iJ ■ (4) 
Proof. Immediate from the representation of the Dirac delta function as a product, 

^(7) = t [E/cd^(7)t~ i (^(k / ))] , ^1(7) = ni 6 /7irii e {i,...,d}\j( i -7<), 

where denotes the vector with k[ = for all i € {1, ... , (i}. □ 
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Constraints The general constraints on binary data are 

(£»6l m-i - |/| + l) V < m 7 < min {m K \ K C J} , (5) 

where the upper bound is the monotonicity of the measure, and the lower bound follows 
from 

l^-i = E^(W-iM7) 

> E 7G B d (Eie/7i - «j(7)) t(7) 

In fact, m/ is a |J (-dimensional copula with respect to the expectations m; for i E I, 
see Nelsen (2006, p. 45), and the inequalities (5) correspond to the Frechet-Hoeffding 
bounds. 

Sampling For sampling from a binary distribution tt, we apply the chain rule factor- 
ization 



T(7) = 7T{l}(7l) nf=2 7r {l:i}(7» I 7l:i-l) 

, (6) 

= ^{1} (7l) Ui=2 K{V.i-l} {lV.i-l) (7l:i) 



which permits to sample a random vector component- wise, conditioning on the entries 
we already generated. We do not even need to compute the full decomposition (6), but 
only the conditional probabilities irsi-a^i = 1 I li-.i-i) defined by 

7r{l:i}(7l:i-l,l) ^ 



7r{l:t}(7l:i-l5 !) + 7f{l:t}(7l:t-l, 0) ' 
The full probability 7r(7) is then computed as a by-product of Procedure 1. 

Procedure 1 Sampling via chain rule factorization 

y = (0,...,0), p<-l 
for i = 1 . . . , d do 

r <~ 7T{ 1:i }(7i = 1 | 7l:i-l) 

sample u ~ W[ 0>1] , j/j <- l[ , r ](u) 

I p ■ r if yi = 1 

p ■ (1 - r) if 2/i = 



P 



end for 
return y, p 



3 The product family 

The simplest non-trivial distributions on M d are certainly those having independent 
components. 
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3.1 Definition 

For a vector m G (0, l) d of marginal probabilities, we define the product family 

The second representation using the logit function 

£: (0, 1) R, £(p) = logp-log(l -p) (9) 
is useful to identify the product family as special case of more complex families. 

3.2 Properties 

We check the requirement list from Section 1.4: 

(a) The product family is parsimonious with dim((9) = d. 

(b) The maximum likelihood estimator m* is the sample mean (1). 

(c) We easily sample from q^° d , since (6) holds trivially. 

(d) We easily evaluate the probability of a product of independent components. 

(e) The family (7^° d does not reproduce dependencies we might observe in the data X. 

The last point is a weakness which makes this simple family impractical when adaptive 
Monte Carlo algorithms are applied to challenging sampling problems. The product 
family q^ od is might often fail to mimick the target distribution tt sufficiently well. 
Therefore, the rest of this paper deals with ideas on how to sample binary vectors with 
a given dependence structure. 

3.3 Beyond the product family 

There are, to our knowledge, two main strategies to produce binary vectors with corre- 
lated components. 

(1) We can construct a generalized linear family which permits computation of its 
marginal distributions. We apply the chain rule factorization (6) and write qg as 

qeil) = le(li) nf= 2 Qe(n I H-.i-i), (10) 
which allows us to sample vectors component-wise. 

(2) We sample from a multivariate auxiliary distribution hg dichotomize the samples, 
that is map them into M d . We call 

qe{l) = f T - lh) h {v)dv (11) 

a copula family since we exploit the copula structure of the underlying distribution 
to build a new parametric family. However, we refrain from working with explicit 
uniform marginals which is not all necessary (Mikosch, 2006). 

In the following, we first study a few generalized linear families and then review a some 
copula approaches. 
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4 The linear quadratic family 

Taking r the identity mapping in (4), we obtain a full linear representation 

^(7) = T,icD a i u i(-y)- 

However, we cannot give a useful interpretation of the coefficients a/. Bahadur (1961) 
derived the following representation: 

Proposition 4.1. We can write any binary distribution as 

^(7) = C° d (7) (EicW7)cj), 
where m = (mi, . . . ,md) are the marginal probabilities. 

Proof. For convenience, we provide the proof of Bahadur in Appendix 10. □ 

This decomposition, first discovered by Lazarsfeld, is a special case of a more general 
interaction theory (Streitberg, 1990) and allows for a reasonable interpretation of the 
parameters. Indeed, we have a product family times a correction term l+£/ e j fc w/(t) cj 
where the coefficients are higher order correlations. 



4.1 Definition 

We can try to construct a more parsimonious family by removing higher order inter- 
action terms. For additive approaches, however, we face the problem that a truncated 
representations do not necessarily define probability distributions since they might not 
be non- negative. 

Still, for a symmetric matrix A, we define the d{d + l)/2 parameter family 

<ao U (7)=M«o+7 T A 7 ), (12) 

where fi > is a normalizing constant and we set ao = — (min_, eB d "f J A-f A 0). Since 
oq is the solution of an NP hard quadratic unconstrained binary optimization problem, 
this definition is of little practical value. 

4.2 Moments 

In virtue of the linear structure, we can derive explicit expressions for the cross-moments 
and marginal distributions, explicit meaning that the complexity is polynomial in d. The 
proofs are basic but rather tedious, so we moved them to the appendix section. 

Next, we give a general formula yielding all cross-moments, including the normalizing 
constant. 

Proposition 4.2. For a set of indices I C D, we can write the corresponding cross- 
moment as 

1 zZiel 2 J2jeD a i,j + J2jei\{i} a i,j 
m i = ^nr + 



2l J l 2l / l(4a + lTAl + tr[A]) 

For a proof see Appendix 10 
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Corollary 4.3. The normalizing constant is 

H = 2~ d+2 (4a + 1 T A1 + tr [A])" 1 , 

and the expected value is 

iff / \ 1 j T^k=l a i,k 

JtL LinQu 7,' ) = ; : r— T . 

iA,a Knj 2 4a + 1 T Al + tr [A] 

The mean rrii is close to 1/2 unless the row a.; dominates the matrix. Therefore, if 
A is non-negative definite, the marginal probabilities mi can hardly take values at the 
extremes of the unit interval. 



4.3 Marginals 

For the marginal distributions 



there are explicit and recursive formulas. Hence, we can compute the chain rule decom- 
position (6) which in turn allows to sample from the family. 

Proposition 4.4. For the marginal distribution holds 



OtI:*) = /^- fc - 2 Sfc ( 7 l: fc ) 



where 



Skhi-.k) = 4a + ELi 7i (EjU TjQij + Ej=fc+i 

For a proof see Appendix 10 

Recall the connection between marginal distributions and moments we observed in 
(3). For 7/ = 1 we obtain 

s/(l fe ) = 4a + 4 E<e/(Ej6/ a ^ + ^je/ c a *j) 

= 4a + EieD YsjeD a i,j + EieD a M + 3 Yjiel ^2jei a M 

= 4a + 1 T Al + tr [A] + 

Yliei ^ YljeD a M + Yljei\{i} a i,j > 

and 71/(1*;) = /i2 d_ ' / ' _2 s/(lfc) is indeed the expression for the cross-moments in Proof of 
Proposition 4.2. 
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4.4 Fitting the parameter 

Given a sample X = {x\, . . . ,x n ) J ~ ir from the target distribution, we can determine 
ao and a matrix A such that the family q^^o n ^ s ^ e nrs ^ an< ^ secon d sampling moments 

x {i,j} = n 1 X/fc=l x k,i x k,ji hj S -D 

by solving a linear system of dimension d(d + l)/2 + 1. We first use the bijection 

7-:I>x£>->.{l,...,d(d + l)/2}, T(i,j)=i(i-l)/2+j 

to map symmetric matrices into R( rf+1 ) d / 2 . Precisely, for the matrices A and X, we 
define the vectors 

and the design matrix 

3 — r > l {ij} (k)+t {ijk} Q) 

b r(i,j),T(k,l) ■— z 1 1 x 1 ■ 

Note that |d| = 1 T A1 +tr [A]. We then equate the distribution moments to the sample 
moments and normalize such that 

2 d ~ 2 (Ia + -So) = x, 2 d " 2 (4a + \a\) = 1. (13) 

The solution of the linear system 

'(i) 

is finally transformed back into a symmetric matrix A*. Since the design matrix does 
not depend on the data, fitting several parameters to different data on the same space 
K d is extremely fast. 

4.5 Properties 

We check the requirement list from Section 1.4: 

(a) The linear family is sufficiently parsimonious having dimension dim(0) = d{d+\)/2. 

(b) We can fit the parameters A and ao via method of moments. However, the fitted 
function ^'"^."(t) is usually not a distribution. 

(c) We can sample via chain rule factorization. 

(d) We can evaluate c^^ u (y) via chain rule factorization while sampling. 

(e) The family reproduces the mean and correlations of the data X. 

Since in applications, the fitted matrix A* is hardly ever positive definite, we cannot 
use the linear family in an adaptive Monte Carlo context. As other authors (Park et al., 
1996; Emrich and Piedmonte, 1991) remark, additive representations like Proposition 
4.1 are instructive but we cannot derive practical families from them. 



a 



-d+2 



i S 
4 ° 

4 IT 



8 



5 The exponential quadratic family 



If vr(7) > for all 7 € M d , we can use r = exp in (4) and obtain a full log-linear 
representation 

7r(7) = exp (j2icD a i Mlfi) ■ 

Note that we assume the probability mass function ir is assumed to be log-linear in 
the parameters a/. In the context of contingency tables the term "log-linear family" 
refers to the assumption that the marginal probabilities mi are log-linear in the higher 
order marginals. 

Remark Contingency table analysis is a well studied approach to modeling discrete 
data (Bishop et al., 1975; Christensen, 1997). For binary data, the underlying sampling 
distribution is assumed to be multinomial which requires an enumeration of the state 
space we want to avoid. Gange (1995) uses the Iterative Proportional Fitting algorithm 
(Haberman, 1972) from log-linear interaction theory to construct a binary distribution 
with given marginal probabilities. The fitting procedures, however, require storage of all 
configurations 717(77) and the construction of the joint posterior from the fitted marginal 
probabilities. The method is powerful and exact but computationally infeasible even for 
moderate dimensions. 

5.1 Definition 

Removing higher order interaction terms, we can construct a d(d+l)/2 parameter family 

^ xpQu (7):=^exp( 7 TA7), (14) 

where A is a symmetric matrix and \i := [^7eB d exp(7 T A7)]~ 1 . We recognize the 
product family (8) as the special case \i = YiieD^ ~ m i) d an d A = diag [£(m)]. 

5.2 Marginals 

The moments or marginal distributions of q& are sums of exponentials which, in general, 
do not simplify to expressions that are polynomial in d. Therefore, we cannot perform 
a chain rule factorization (6) to sample from the family. 

Cox and Wermuth (1994) proposed the following second degree Taylor approximations 
to the marginal distributions which are again of the form (14). 

Proposition 5.1. We write the parameter A as 

*-(?*')• < 15 > 

and define the parameters 

A d _! = A' + (1 + tanh(§)) diag [b] + \ sech 2 (§)&6T, 
fid-i = M(l + ex P( c )) 

Then g Ad 1 (7i:d-i) is the second degree Taylor approximation to the marginal distribu- 
tion §A 1 . d _ 1 (7i:rf-i)- For a proof see Appendix 10. 
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If we recursively compute , . . . , q Ai , we can derive approximate conditional proba- 
bilities using (7). Precisely, we have 

?Ai(7i = 1 I 7i:i-i) = ^{c t + 6T 7 i :i _i), (16) 

where £~ 1 (x) = (1 + exp(— and Cj, bi are parts of the matrix Aj according to the 
notation introduced in (15). In particular, (16) is a logistic regression. We come back 
to this class of families in the following Section 6. We can sample from the proxy 

QAh) ■= UieD ?A,(7i I 7i:i-i) « C PQU (7), 

which is close to the original exponential quadratic family. The goodness of the approx- 
imation might be improved by judicious permutation of the components. The approx- 
imation error is hard to control, however, since we repeatedly apply the second degree 
approximation and propagate initial errors. 

5.3 Fitting the parameter 

As in section 4.4, we use the bijection 

r: D x D {1, . . . , d(d + l)/2} , r(i,j) = i(i - l)/2 + j 

to map symmetric matrices into ]R( ci + 1 ) d / 2 . Precisely, for the matrices A and X, we 
define the vectors 

We let yk = \ogir{xk) for k = 1, . . . , n and fit the family solving the least square problem 



mm aGK( d + 1 ) d / 2 

which yields the parameters 



Xa - y 



Note that in most adaptive Monte Carlo algorithms that involve importance sampling 
or Markov transitions, the probabilities ^(x^) of the target distribution are already 
computed such that the fitting procedure is rather fast. 

5.4 Properties 

We check the requirement list from Section 1.4: 

(a) The log-linear family is sufficiently parsimonious with dim(#) = d(d+ l)/2. 

(b) We can fit the parameter A via minimum least squares. 

(c) We can sample from an approximation q^(y) ~ <7a(t) to the log-linear family. 
However, we cannot control the approximation error. 

(d) We can evaluate qA,a (y) up to the normalization constant n which suffices for most 
adaptive Monte Carlo methods. 

(e) The family qA,a reproduces the mean and correlations of the data X. 
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6 The logistic conditionals family 

In the previous section we saw that even for a rather simple non-linear family we can- 
not derive closed-form expressions for the marginal probabilities. Therefore, instead 
of computing the marginals for a d-dimensional family ge('y), we directly fit univariate 
families 

90 (7t = 1 I 7l:i-l)> * 6 D 

to the conditional probabilities 7r(7j = 1 | ji-.i-i) of the target function. Precisely, we 
postulate the logistic relation 

£(¥ n ( 7i = l)) = bi,i + E%\ Km > i £ d 
for the marginal probabilities where i is the logit function defined in (9). 

6.1 Definition 

For a (i-dimensional lower triangular matrix B, we define the logistic conditionals family 



as 



[fiihi + b J,l:i-l7l : i-l) - log (l + eMk,i + 6i,i:i_i7l:i-l 



= exp 

where (?p rod is the Bernoulli distribution and 



p(x) = = (1 + exp(-x))" 1 

the logistic function. We immediately identify the product family q^ d as the special 
case B = diag [^(tm)] . The logistic conditionals family is not in the exponential family. 

Note that there are d\ possible logistic families and we arbitrarily pick one while there 
should be a permutation cr(D) of the components which is optimal in a sense of nearness 
to the data. In practice, however, changing the parametrization does not seem to have 
a noticeably impact on the quality of the adaptive Monte Carlo algorithm. 



6.2 Sparse logistic regressions 

The major drawback of all multiplicative families is the fact that they do not have closed- 
form likelihood-maximizers such that the parameter estimation requires costly iterative 
fitting procedures. Therefore, we construct a sparse version of the logistic regression 
family which we can estimate faster than the saturated family. 

Instead of fitting the parameter of the saturated family q\° sCo {^i \ 7i : j-i), we preferably 
work with a more parsimonious regression family like f/5° sCo (7i | 7lJ for some index set 
Li Q {1, ■ ■ ■ ,i — 1}, where the number of predictors is typically smaller than i — 1. 

We solve this nested variable selection problem using some simple, fast to compute 
criterion. For e about we define the index set 

I := {i = 1, . . . ,d \ Xi £ (e, 1 — e) }. 

which identifies the components which have, according to the data, a marginal proba- 
bility close to either boundary of the unit interval. 
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We do not fit a logistic regression for the components i £ I. We rather set Lj = and 
draw them independently, that is we set = £(xi) and b^-j = which corresponds 
to logistic conditionals family without predictors. The reason is twofold. Firstly, in- 
teractions do not really matter if the marginal probability is excessively small or large. 
Secondly, these components are prone to cause complete separation in the data or might 
even be constant. 

For the conditional distribution of the remaining components I c = D\I, we construct 
parsimonious logistic regressions. For 6 about j^, we define the predictor sets 

Li := {j = 1, . . . ,i - 1 | S < \nj\}, i e r, 

which identifies the components with index smaller than i and significant mutual asso- 
ciation. 



6.3 Fitting the parameter 

Given a sample X = (sci, . . . , x n ) J ~ ir from the target distribution we regress = Xj 
on the columns Z^) = (Xi : i_i, 1), where the column Zf' yields the intercept to complete 
the logistic conditionals family. 

We maximize the log-likelihood function £{b) = £(b \ y, Z) of a weighted logistic 
regression family by solving the first order condition d£/df3 = 0. We find a numerical 
solution via Newton-Raphson iterations 

_« (6(r+1 ,_, W) ^, r>0 , (18) 

starting at some b(°) ; see Procedure 2 for the exact terms. Other updating formulas like 
Iteratively Reweighted Least Squares or quasi-Newton iterations should work as well. 

Procedure 2 Fitting the weighted logistic regressions 
Input: w = (wi, . . . , w n ), X = (x 1 , ... , x n ) J , B <E R dxd 

for i € I c do 

Z<-(X Li ,l), y^X,, &(°)<-B i)iiU{i} 
repeat 

p k <- £- 1 (Z fc 6( r - 1 )) for all k = 1, . . . ,n 
q k <- p k (l - p k ) for all k = 1, . . . , n 

<- (Z T diag [to] diag [q] Z + el n ) _1 x 

(ZTdiag [w]) (diag [q] Z + (y - p)) 

until |^ r) - b { .~ 1] \ < 10" 3 for all j 

B i,LiU{i} <~ b 

end for 
return B 



Sometimes, the Newton-Raphson iterations do not converge because the likelihood 
function is monotone and thus has no finite maximizer. This problem is caused by data 
with complete or quasi-complete separation in the sample points (Albert and Anderson, 
1984). There are several ways to handle this issue. 
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(a) We just halt the algorithm after a fixed number of iterations and ignore the lack of 
convergence. Such proceeding, however, might cause uncontrolled numerical prob- 
lems. 



(b) Firth (1993) recommends the Jeffreys prior for its bias reduction but this option 
is computationally rather expensive. We might instead use a Gaussian prior with 
variance 1/e > which adds a quadratic penalty term eb J b to the log-likelihood to 
ensure the target-function is convex. 

(c) As we notice that some terms of bi are growing beyond a certain threshold, we move 
the component i from the set of components with associated logistic regression family 
I c to the set of independent components I. 

In practice, we recommend to combine the approaches (c) and (d). In Procedure 2, we 
did not elaborate how to handle non-convergence, but added a penalty term to the log- 
likelihood, which causes the extra el n in the Newton-Raphson update. Since we solve 
the update equation via Cholesky factorizations, adding a small term on the diagonal 
ensures that the matrix is indeed numerically decomposable. 

6.4 Properties 

We check the requirement list from Section 1.4: 

(a) The logistic regression family is sufficiently parsimonious with dim(#) = d(d+ l)/2. 

(b) We can fit the parameters bi via likelihood maximization for all i G D. The fitting 
is computationally intensive but feasible. 

(c) We can sample y ~ q^ sCo via chain rule factorization. 

(d) We can exactly evaluate q^ eCo (y)- 

(e) The family q^ sCo reproduces the dependency structure of the data X although we 
cannot explicitly compute the marginal probabilities. 

7 The Gaussian copula family 

In the preceding sections, we discussed three approaches based on generalized linear 
families. Now we turn to the second class of families we call copula families. 

Let hg be a family of auxiliary distributions on X and r : X — >■ M d a mapping into the 
binary state space. We can sample from the copula family 



by setting y = h(v) for a draw v ~ hg from the auxiliary distribution. 
7.1 Definition 

Apparently, non-normal parametric distributions sg with at most d(d— l)/2 dependence 
parameters either have a very limited dependence structure or rather unfavorable prop- 
erties (Joe, 1996). Therefore, the multivariate Gaussian distribution with 



lYil) = fr 



hg(v) dv 
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and mapping r : W 1 — >• M d 



r M («) = (l (00)W] (t;i),...,l (Vd)), 

appears to be the natural and almost the only option for hg. The Gaussian copula family, 
denoted by gjjjf , has already been discussed repeatedly in the literature (Emrich and 
Piedmonte, 1991; Leisch et al, 1998; Cox and Wermuth, 2002). 

7.2 Moments 

For I C. D, the cross- moment or marginal probabilities is 

mi = E 7e B^M,s(l/,7D\/) = L^ Bd {r^(i inDV )} hv{v)dv 

where we used (3) in the first line. Thus, the first and second moment of <?(/x,s) are 

mi = #i (//»), niij = $2(fJ>i, H'i a i,j) 

where and <&2{vi,Vj \ aij) denote the cumulative distribution functions of the uni- 

variate and bivariate normal distributions with zero mean, unit variance and correlation 
coefficient o"jj S [—1,1]. 

7.3 Sparse Gaussian copulas 

We can speed up the parameter estimation and improve the condition of S, if we work 
with a parsimonious Gaussian copula. We can apply the same criterion we already 
introduced for the sparse logistic regression family. For e about j^j, we define the index 
set 

I :={i = l,...,d\ Xi £ (e, 1 - e) }. 

which identifies the components which have a marginal probability close to either bound- 
ary of the unit interval. 

We do not fit a any correlation parameters for the components in / but set cjjj = for 
all j € D\{i}. Firstly, the correlation does not really matter if the marginal probability 
is excessively small or large. Secondly, we fit the parameter I] by separately adjusting the 
bivariate correlations crjj, and components with high correlations and extreme marginal 
probability lower the chance that S is positive definite. 

For the remaining components I c = D\I, we construct parsimonious Gaussian copula. 
For S about 4r, we define the association set 

A:={{i,j}el c xl c \5< \ nj \,i^j} 

which identifies the components with significant correlation. For i, j € D x D \ L we 
also set cjjj = to accelerate the estimation procedure. 
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7.4 Fitting the parameter 



We fit the family (fGauc^jj) to the data by adjusting /i. and X! to the sample moments. 
Precisely, we solve the equations 

$l(jH)=Xi, ieD (19) 

&2(fHifJ-j',<7i,j) = (iJ)tA ( 2 0) 

with sample mean Xj and Xij as defined in (1). We easily solve (19) by setting 

m = *i 1 (s i ) J i e D. 

The difficult task is computing a feasible correlation matrix from (20). Recall the stan- 
dard result (Johnson et al., 2002, p. 255) 

= K{yi,y 2 ), (21) 

where h a denotes the density of the bivariate normal distribution. We obtain the fol- 
lowing Newton-Raphson iteration 

a r+1 = a r — j r J -, y eA, (22) 

starting at some «o G (— 1, 1). We use a fast series approximation (Drezner and 
Wesolowsky, 1990; Divgi, 1979) to evaluate $2(^1 Mj5 a )- These approximations are 
critical when a r comes very close to either boundary of [—1,1]. The Newton iteration 
might repeatedly fail when restarted at the corresponding boundary ro G { — 1,1}. This 
is yet another reason why it is preferable to work with a sparse Gaussian copula. In any 
event, $2 (yi , 2/2 ; cr) is monotonic in a since (21), and we can switch to bi-sectional search 
if necessary. 

Procedure 3 Fitting the dependency matrix 
Input: Xi, Xij for all i,j G D 

Hi = for all i G D 

S = Id 

for (£, j) G A do 
repeat 

(r+ l) (r) &2(jii,Hj',<Tl r j)-Xij 



1,3 



until |ag -<t£ _1) I < 10 ~ 3 
end for 

if not £ ^ then S <- (S + |A| I d )/(1 + |A|) 
return /x, £ 



A rather discouraging shortcoming of the Gaussian copula family is that locally fitted 
correlation matrices £ might not be positive definite for d > 3. This is due to the fact 
that an elliptical copula, like the Gaussian, can only attain the bounds (5) for d < 3, 
but not for higher dimensions. 

We propose two ideas to obtain an approximate, but feasible parameter: 
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(1) We replace T, by E* = (S + |A|I)/(1 + |A|), where A is the smallest eigenvalue of 
the dependency matrix XI. This approach evenly lowers the local correlations to 
a feasible level and is easy to implement on standard software. Alas, we make an 
effort to estimate d(d — l)/2 dependency parameters, and in the end we might not 
get more than an product family. 

(2) We can compute the correlation matrix which minimizes the distance ||S* — 
where ||A||^ = tr [AA T ]. In other words, we construct the projection of S into the 
set of correlation matrices. Higham (2002) proposes an Alternating Projections 
algorithm to solve nearest-correlation matrix problems. Yet, if S is rather far from 
the set of correlation matrices, computing the projection is expensive and, according 
to our experience, leads to troublesome distortions in the correlation structure. 

7.5 Properties 

We check the requirement list from Section 1.4: 

(a) The Gaussian copula family is sufficiently parsimonious with dim(#) = d(d + l)/2. 

(b) We can fit the parameters n and X! via method of moments. The parameter S is 
not always be positive definite which might require additional effort it feasible. 

(c) We can sample y ~ q^ n ^ using y = t^(v) with v ~ hs- 

(d) We cannot evaluate q^ au0 (y) since this requires computing a high-dimensional inte- 
gral expression. 

(e) The family q^ u ^ reproduces the mean and correlation structure of the data X. 

Obviously, we cannot use the Gaussian copula family in the context of importance sam- 
pling or Markov chain Monte Carlo, since evaluation of q^ u ^(y) is not feasible. This 
family might be useful, however, in other adaptive Monte Carlo algorithms, for instance 
the Cross-Entropy method (Rubinstein, 1997) for combinatorial optimization. 

8 The Poisson reduction family 

Let iV = {l,...,n} denote another index set with n ^> d. Approaches to generating 
binary vectors that do not rely on the chain rule factorization (6) are usually based on 
combinations of independent random variables 

v = (vi, . . . ,v n ) ~ ®k&Nh 9k . 

We define index sets A4 = {Si € N j i G D} and generate the entry via 

Ti : ^ {0,1} , n(v) = f(v Si ), ieD. 

In the context of Gaussian copulas, the auxiliary distributions hg k = hg are d independent 
standard normal variables. Park et al. (1996) propose the following family based on sums 
of independent Poisson variables. 
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8.1 Definition 

We define a Poisson family of^x) with auxiliary distribution 

and mapping r : Nq — >• B d 

r s («) = (l{o}(Efce Sl ^)--- 1 {0}(Efeg5 d ^))- 

8.2 Moments 

For an index set I £ D, the cross- moments or marginal probabilities are 

mi = P (Vi G J: Efces, u fe = °) = ex P(" Efcen^Si A *)- 
Therefore, fitting via method of moments is possible. 
Proposition 8.1. For 7 E K d , define the index sets 

D = {i G £> 1 7i = 0} , Z?i = {i G r> 1 7i = 1} , 

and the families of subsets It = {I G D\ \ \I\ = t}. We can write the mass function of 
the Poisson family as 

<Z(s;A)(7) = £„ e .- 1(T ) h x (v) 



m Do 



For a proof see Appendix 10. 

8.3 Fitting the parameter 

We need to determine the family of index sets A4 and the Poisson parameters A = 
(Ai, . . . , A n ) such that the resulting family q(s,\) is optimal in terms of distance to the 
mean and correlation. Obviously, we face a rather difficult combinatorial problem. Park 
et al. (1996) describe a greedy algorithm, based on convolutions of Poisson variables, 
that finds at least some feasible combination of S and A. 

8.4 Properties 

We check the requirement list from Section 1.4: 

(a) The Poisson reduction family is not necessarily parsimonious. The number of pa- 
rameters dim(#) is determined by the fitting algorithm. 

(b) We fit the family via method of moments using a fast but non-optimal greedy algo- 
rithm. 

(c) We sample y ~ ifsx) usm S V = T s( v ) with v ~ h\. 

(d) We cannot evaluate 0^§\^(y) since it requires summation of 2 d ~' y — 1 terms using 
an inclusion-exclusion principle which is computationally not feasible. 



17 



(e) The family qJ§\-) can partially reproduce the mean and certain correlation structures 
of the data X. We cannot sample negative correlations. 

Since the family is limited to certain patterns of non-negative correlations, we cannot 
use it as general-purpose family in adaptive Monte Carlo algorithms. It might be useful, 
however, if we know that the target distribution tt has strictly non-negative correlations. 

9 The Archimedean copula family 

Genest and Neslehova (2007) discuss in detail the potentials and pitfalls of applying 
copula theory, which is well developed for bivariate, continuous random variables, to 
multivariate discrete distribution. Yet, there have been earlier attempts to sample binary 
vectors via copulas: Lee (1993) describes how to construct an Archimedean copula, more 
precisely the Frank family, (see e.g. Nelsen (2006, p. 119)), for sampling multivariate 
binary data. 

Unfortunately, most results in copula theory do not easily extend to high dimensions. 
Indeed, we need to solve a non-linear equation for each component when generating a 
random vector from the Frank copula, and Lee (1993) acknowledges that this is only 
applicable for d < 3. For low-dimensional problems, however, we can just enumerate the 
solution space M d and draw from an alias table (Walker, 1977), which somewhat renders 
the Archimedean copula approach an interesting exercise, but without much practical 
value in Monte Carlo applications. 
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10 Appendix 



Proof Proposition 4..1. Recall that X = 2 D and vi(j) = YlitiKli ~ m i) I \J m %(X ~ m i)] 
with rrii > for all i G D. We define an inner product 



(f,g) := E g P Iod (/(7)5(7)) = Z-ym* f (l)9(l) ILezWa 



Tin -m^ 1 -^ 



on the vector space of real- valued functions on M d . The set S = {^7(7) \ I & T} is 
orthonormal, since 

/ ^ n ip ( (a- m i) 2 \ tt w ( n-m \ 

(vj,rj) = || E g p rod — -— — || E„ Prod 



4£ /nj V mi(1 - mj V Je( /uJ)\(/nJ) q ^\V^-mi)J 

Jo for / ^ J 
~ [1 for/=J, 

There are 2 rf — 1 elements in S and 9^, od (7) > which implies that 5U{1} is an orthonor- 
mal basis of the real- valued function on M d . It follows that each function / : M d — >• M 
has exactly one representation as linear combination of functions in S U {1} which is 

/ = (/,!) +12iez v i(f> v i)- Since 

(tt/CV/) = E 7G B4^(7)/C)( 7 )^(7)^ od (7) = ^ (vi(l)) = ci, 

we obtain Tr(l)/qm° d (l) = 1 + J2iei v i(l) °i f° r / = 7T /lm° d which concludes the 
proof. □ 

Proof Proposition 4. 2. We first derive two auxiliary results to structure the proof. 

Lemma 1. For a set /CDof indices it holds that 

Yl Ik = 2 d -l / l- 2+1 'W +1 ^wO'). 

7GB d k£lU{i,j} 

For an index set M C D, we have the sum formula X^ 7 GB d FLeAf Ik = 2 d_ l M L If we have 
an empty set M = the sum equals 2 d and each time we add a new index i 6 D\M 
to M half of the addends vanish. The number of elements in M = I U {i,j} is the 
number of elements in / plus one if i ^ / and again plus one if i 7^ j and j ^ /. 
Written using indicator function, we have |/U {i, j}\ = \I\ + l D \j(i) + ~&D\(lu{i})(j) = 
\I\ + 2 — l/(i) — l/u{«}(i) which implies Lemma 1. □ 



Lemma 2. 



J2 2 1 '«+i/u { <>Cj) ay = 1TA1 + tr [A] + £ 

ieDjeD iei 



j&D jei\{i) 



Straightforward calculations: 

2 M0+iiu W Ci) = (i + + i /u{i} (j)) 

= (1 + lj(i))(l + l/(j) + " l/n»(j')) 

= 1 + + l/(j) + l/(i)l/(j) 

+ l {i} (i) + l/(*)l W (j) - t In{i} (j) - tj(i)l In{i} (j) 
= 1 + + l/(i) + lj(j) + l/ x /(i,i) - l/n{i}(i), 
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where we used the identity 

= li(i)li{i)t {i} U) = l/(z)l/(j)l { i}0') = l/«l/n{i}(i) 
in the second line. Thus, we have 

ieDjeD 

= YY( 1 + 1 W0') + + + ^x/^i) - l/n{i}(j)) OiJ 
ieDjeD 

= Y Y a ^ + Z a ^ + Z Z a ^ + Z Y ai >j + YY a ^ - Y a o>. 

ieDjeD jeD iei jeD ieD jei iei jei 



iei 



VA1 + tr [A] + 



kei 



VA1 + tr [A] + Y 

kei 



ieD is/ 



/G-D ZgAM 



The last line is the assertion of Lemma 2. 

Using the two Lemmata above, we find a convenient expression for the cross-moment 

mi= Y (II 7fc ) + 7 T A7) 
fee/ 



□ 



Y a ° + Y (II 7fc ) Y Y a 

kei ieDjeD 



1 -j 



ieDjeD -yeM d keiu{i,j} 



(Lemma 1) 



2<H'i ao + Y J Y 2d ~ lIU{i,j}l 

ieD jeD 



fi2 



d-\I\-2 



4a + EE 211 



(i)+liu{i}0') 



ieDjeD 



(Lemma 2) 



^2 



d-|7|-2 



4a + 1 T Al + tr [A] + ^ 



iei 



2 Y ai 'i + Y ' 
jeD jei\{i} 



1 



Since mj = 1 by definition, we the normalizing constant is 

/i = 2~ d+2 (4a + 1 T A1 + tr [A])" 1 , 
which allows us to write down the normalized cross-moments 

1 zZiel 2 J2jeD a i,j + J2jei\{i} a iJ 

mi = w\ + - 



2l 7 l(4a + UAl + tr [A]) 



The proof is complete. 



□ 
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Proof Proposition 4-4- We margin out the last component d. Let I = {1, . . . ,d — t}, 

Co^^" 1 = (ff&ofr. !) + 95foo(7/,0)) ^ 

= 2a + (7/, 1) T A( 7/ , 1) + (7/, 0) T A( 7/ , 0) 
= 2a + tr [A [( 7J , 1)( 7/ , 1)T + ( 7/ , 0)( 7 /, 0)^]] 

"27/7/ li 



2oq + tr 



1} 



Iterating the argument, we obtain for I = {1, ... ,d — t} and I c = D\I 



ff2£W- 1 = 2 t «o + 2*~ 2 tr 



4 7j7 J 2 7/lJ 

2it 7 ; i t ij + it 



Straightforward calculations: 



tr 



4 7 /7/ T 2 7/1 J 
21 t7/ T ltlj+lt. 
tr [A [(2 7 /, 1^(27/, 1*) T + diag [0/, l t ]]] 
[(2 7/ , l t ) T A(2 7 /, l t ) + tr [Adiag [0/, l t )]]] 



iei je/ 



is/ je/ c 



iei c jei c 



i€l c 



ie/ j'£/ ie/ c ie/ c ie/ c ie/ c 



The proof is complete. 



□ 



Proof Proposition 5.1. For convenience of notation, let 7_ = (71, . . . ,7^—1). Note that 
(7) — //exp(7lA'7_ + 7^(26 T 7_ + c)). The marginal distribution is therefore 

vr(7_) = // exp(7lA'7_) (l + exp(27l& + c)) 

= A* exp (7IAV + 7I6 + (exp(-7l6 - |) + exp(7lb + |) 
= exp ( 7 IA'7_ + 7I6 + |) 2 cosh (7I& + |) . 
The marginal log mass function is thus 

log 7r(7_) = log(2/x) + I + 7I A ; 7_ + 7I& + log cosh (7I& + . 
For log cosh we can use a Taylor approximation 

logcosh(7lbH — ) « logcosh(-) +7!^ tanh(-) H — (7I&) 2 sech 2 (-) 
2 2 2 2 2 



to obtain 



c c 

logvr(7_) « log(2/icosh(-)) + - +7IA7- 

+ (1 + tanh(|)) 7 l6 + 1 sech 2 (|)( 7 Ifo) 2 
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Since 7_ is a binary vector, we have 7I& = 7ldiag [b] 7_ and can thus rewrite the inner 
products as 

7I A'7_ + 7I6 + {jib) 2 = tr [A'7_7l + diag [b] 7^7! + 6b T 7_7l] 

= 7l(A' + diag [b] + 6b T ) 7 _. 

We let denote 

c c c c c 

H* = 2^cosh(-)exp(-) = //(exp(--) + exp(-)) exp(-) = fi(l + exp(c)) 



and 



to form the approximation vr(7_) ~ fi* exp(7lA*7_) which completes the proof. 



A* = A' + ( 1 + tanh(-) ) diag [b] + - sech 2 (-)5b T 



□ 



Proof Proposition 8.1. Straightforward calculations using an inclusion-exclusion argu- 
ment for the union of events: 

Q(s,\)(l) = E hx ^ = Fh * i 1 ^} ^fce5, v k = 7i}) 

i>£t _1 (7) 

= F hx (n ieDl n k£Si H = 0} , n ieDo u keSi {v k > o}) 
= ¥ hx (n ieDl n k&Sl {v k = o})P h , A (n ieDo u k€Si \ UjeDiSj {v k > o}) 
= (7Di = 1) (1 - n x (u ieA) n fce5AUje0iSj H = o} N 



m Do 



m Do 



t=l 
\Do\ 



ICXt 



1 - E(" 1 ) <_1 E p ^ (n i6 / n fceSAu , 6DiS . K = 0} 



E 



-Afe 



t=l 



ICXt \ker\ ieI Si\Uj eDl Sj 



The proof is complete. 



□ 
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